vector distribution
Curator: Efficient Indexing for Multi-Tenant Vector Databases
Jin, Yicheng, Wu, Yongji, Hu, Wenjun, Maggs, Bruce M., Zhang, Xiao, Zhuo, Danyang
Vector databases have emerged as key enablers for bridging intelligent applications with unstructured data, providing generic search and management support for embedding vectors extracted from the raw unstructured data. As multiple data users can share the same database infrastructure, multi-tenancy support for vector databases is increasingly desirable. This hinges on an efficient filtered search operation, i.e., only querying the vectors accessible to a particular tenant. Multi-tenancy in vector databases is currently achieved by building either a single, shared index among all tenants, or a per-tenant index. The former optimizes for memory efficiency at the expense of search performance, while the latter does the opposite. Instead, this paper presents Curator, an in-memory vector index design tailored for multi-tenant queries that simultaneously achieves the two conflicting goals, low memory overhead and high performance for queries, vector insertion, and deletion. Curator indexes each tenant's vectors with a tenant-specific clustering tree and encodes these trees compactly as sub-trees of a shared clustering tree. Each tenant's clustering tree adapts dynamically to its unique vector distribution, while maintaining a low per-tenant memory footprint. Our evaluation, based on two widely used data sets, confirms that Curator delivers search performance on par with per-tenant indexing, while maintaining memory consumption at the same level as metadata filtering on a single, shared index.
Optimal transport for vector Gaussian mixture models
Zhu, Jiening, Xu, Kaiming, Tannenbaum, Allen
Finite mixture models can describe a wide range of statistical phenomena. They have been successfully applied to numerous fields including biology, economics, engineering, and social sciences [15]. The first major use and analysis of mixture models is perhaps due to the mathematician and biostatistician Karl Pearson over 120 years ago who explicitly decomposed a distribution into two normal distributions for characterizing non-normal attributes of forehead to body length ratios in female shore crab populations [16]. The literature on analyzing and applying mixture models is growing due to their simplicity, versatility and flexibility. One of the most commonly used mixture models is the Gaussian mixture model (GMM), which is a weighted sum of Gaussian distributions.
Learning via Gaussian Herding
We introduce a new family of online learning algorithms based upon constraining the velocity flow over a distribution of weight vectors. In particular, we show how to effectively herd a Gaussian weight vector distribution by trading off velocity constraints with a loss function. By uniformly bounding this loss function, we demonstrate how to solve the resulting optimization analytically. We compare the resulting algorithms on a variety of real world datasets, and demonstrate how these algorithms achieve state-of-the-art robust performance, especially with high label noise in the training data.